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Abstract. We have recently carried out a computational campaign to investigate 
a model of coronal heating in three-dimensions using reduced magnetohydrodynamics 
(RMHD). Our code is built on a conventional scheme using the pseudo-spectral method, 
and is parallelized using MPI. The current investigation requires very long time inte- 
grations using high Lundquist numbers, where the formation of very fine current layers 
challenge the resolutions achievable even on massively parallel machines. We present 
here results of a port to Nvidia CUDA (Compute Unified Device Architecture) for hard- 
ware acceleration using graphics processing units (GPUs). In addition to a brief dis- 
cussion of our general strategy, we will report code performance on several machines 
which span a variety of hardware configurations and capabilities. These include a desk- 
top workstation with commodity hardware, a dedicated research workstation equipped 
with four Nvidia C2050 GPUs, as well as several large-scale GPU accelerated dis- 
tributed memory machines: Lincoln/NCSA, Dirac/NERSC, and Keeneland/NICS. 



1. Introduction 

The Reduced Magnetohydrodynamics Coronal Tectonics code (RMCT) solves the re- 
duced MHD equations in three dimensions using a standard pseudo-spectral semi- 
implicit scheme and predictor-corrector time stepping. The code has been tailored 
for studying the role of magnetic reconnection and singular current layers in the heat- 
ing of the so lar corona. Within the framework of Parker's model of coron al heating 



( Parker] [T972h . a recent analysis in two dimensions ( iNg & Bhattacharied l2008) demon- 



strated that when coherence times (r co /j) of imposed photospheric turbulence are much 
smaller than characteristic resistive time-scales (r#), the Ohmic dissipation scales inde- 
pendently of resistivity. While their initial 2-D RMHD treatment precluded non-linear 
effects such as instabilities and/or magnetic reconnection they further invoked a simple 
analytical argument that demonstrated that even considering these non-linear effects 
which would limit the growth of B ± , given small enough r co h, the same insensitivity 
to resistivity would be recovered. The RMCT code h as been used in a computational 



campaign to extend this analysis to three dimensions (INg et al.ll201 11) . Spanning three 



orders of magnitude in Lundquist number, this campaign requires very high resolu- 
tion to correctly resolve dissipative MHD structures, and very long time integrations 
to obtain adequate statistics. The high Lundquist number limit has proved particularly 
challenging even when parallelized on distributed memory machines. In this paper we 
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present a comprehensive reprogramming of our RMHD code for hardware acceleration 
using general purpose graphics processing units (GPUs). 

Recent years have seen the rapid emergence of graphics processing units as hard- 
ware accelerators for general purpose computation and high performance computing. 
Computational scientists have benefited from GPUs in fields as diverse as geology, 
molecular biology, weather prediction, high energy nuclear physics (lattice QCD), quan- 
tum chemistry, finance and oil exploration. 

In computational plasma astrophysics, several grou ps have reported progress using 
GPU acceleration for magnetohydrodynamics ( MHD) ( Wong et all 20091 : IWang et all 
20101 : IZinkll201lh. astrophysical gyro -kinetics dMadduri et alj|201ll ). and particle-in- 
cell simulations (Stant chev et al. 

The work we describe here is most similar to that of IStantchev et all (l2009h . Us- 
ing a G80 generation (128 stream processors single precision) NVidia GPU, compared 
with a single 3.0 GHz Intel Xeon using 1024 2 perpendicular resolution, they report a 
up to a 14x speedup for a Hasegawa-Mima equation solver and 25-30x speedup for 
a pseudo-spectral RMHD code in single precision. We describe in this paper a full 
fledged three dimensional reduced MHD production code and report code performance 
using the latest generation Nvidia GPUs on GPU equipped workstations as well as sev- 
eral distributed memory GPU accelerated machines. 



2. Reduced Magnetohydrodynamics and the Parallel Numerical Scheme 

The RMHD equations are a simplified version of MHD applicable to systems where the 
plasma is dominated by a strong guide field such that the timescales of interest are slow 
compared with the characteristic Alfven timescale ta- These restrictions also imply 
incompressibility (V • V = 0) and the exclusion of magnetosonic modes (leaving only 
the shear Alfven modes propagatin g in e 7 ). The RMHD equation s w ere first derive d 



the shear Altven modes propagating m e 7 ). the KMHU equations were nrst derived 
for the study of tokamak plasmas by Kadomt sev & Pogutsd (11974 ) and lStraussI dl976h . 



which can be written in dimensionless form as 

— + [0,Q] - — + [A, /] + vViO, (1) 
ot oz 

OA dd> , 

— + [<f>,A] = + rjV 2 ± A, (2) 
ot oz 

where A is the flux function so that the magnetic field is expressed as B = e z + B ± = 
e z + V ± A x e z ; <p is the stream function so that the fluid velocity field is expressed as 
v = V ± x e~ z ; fl = -V 2 is the z-component of the vorticity; J - -V\A is the z- 
component of the current density; and the bracketed terms are Poisson brackets such 
that, for example, [<p, A] = <p y A x -<p x A y with subscripts here denoting partial derivatives. 
The normalized viscosity v is the inverse of the Reynolds number R v , and resistivity rj 
is the inverse of the Lundquist number S . The normalization adopted in equations £[]) 
and (O is such that the magnetic field is in the unit of B z (assumed to be a constant in 
RMHD); velocity is in the unit of va = B z /(4jrp) 1 ^ 2 with a constant density p; length is 
in the unit of the transverse length scale L ± ; time t is in the unit of L ± /va ; ?7 is in the 
unit of 4kvaL ± /c 2 ; and v is in the unit of pv^L ± . 



The n umerical sc heme we employ in this work was adapted from lLongcope & Su dan 



( 1994) and lLongcopd (119931) . The simulation domain is a rectangular cartesian box of 
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Figure 1 . (a) Average energy dissipation rate for different values of r\. A is Ohmic 
dissipation, □ is viscous dissipation, is the total of the two, and O is the footpoint 
Poynting flux, (b) Average perpendicular magnetic field strength for different values 
of rj. 



(L z x L ± x L ± ), permeated by guide-field B z line-tied at both ends representing the 
photosphere. Time integration is performed with a second order predictor corrector 
method. Perpendicular dimensions are bi-periodic for a pseudo-spectral scheme using 
standard two-thirds rule de-aliasing and in the parallel dimension a second order finite 
difference method is used. The original was written in Fortran90 and parallelization 
is accomplished by domain decomposition in e z using MPI. Typical resolutions used 
for our coronal heating scaling study range from 64 2 x 32 up to 1024 2 x 128 and as 
with many pseudo-spectral schemes, the 2D FFTs dominate the computational burden 
typically consuming more than 80% of computation time for the resolutions we target. 



3. Numerical Results on Parker's Model of Coronal Heating 

It is crucial to the scaling study that we obtain good statistics averaging over time evo- 
lution in statistical steady state. As with previous long time integration studies of the 
Parker model, the runs are started with a vacuum potential field. After a time of the 
order of the resistive diffusion time, the system will evolve to a statistical steady state. 

The range of n has been extended to lower value s (with t c = 10 <s j r ) fo r about 
an order of magnitude as compared with the study in dLongcope & Sudanll 19941) which 
stopped at n ~ 10~ 3 . This extension of course requires significant increase in resolution, 
with our highest resolutio n case at 512 2 x 64 so far, as compared with 48 2 x 10 in 



dLongcope & Sudani! 19941) . The main difficulty in performing these simulations is the 
requirement to run up to hundreds or even thousands of Alfven times in order to obtain 
good statistics of the average quantities under the driving of random boundary flow. 

Fig-fflshows some of the scaling results we obtained so far. In Fig.[T](a), the time- 
averaged Ohmic dissipation rate W n (at the saturated level) for different n is plotted 
in triangles, while the viscous dissipation rate W v are plotted in squares. Note that 
W v <sc W v in general. The time-averaged Poynting flux I is also plotted in the same 
graph in circles. It is supposed to be of the same value as W theoretically, and we do see 
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Name 


CPU 


Nodes 


GPUs 


Network 


Carver/NERSC 


Nehalem(8 core)2.67 Ghz 


400 


None 


QDR 


Lincoln/NCSA 


Harpertown(4 core)2.33Ghz 


192 


96 x S2070 


SDR 


Dirac/NERSC 


Nehalem(8 core)2.67 Ghz 


44 


44 x C2050 


QDR 


UAF Workstation 


Gulftown(12 core)2.8 Ghz 


1 


4 x C2050 




Keeneland/NICS 


Westmere(6 core)2.67 Ghz 


120 


360 x C2070 


QDR 



Table 1. Specifications for Carver/NERSC and several GPU accelerated machines. 



that the differences between these two quantities are generally small in our numerical 
results, indicating acceptable accuracy. 

Due to the fact that we are doing 3D simulations, and that we need to simulate 
for a long time to obtain good statistics, so far we have only been able to extend the 
value of n to about an ord er of magnitude lower, as compared with similar studies in 
(lLongcope & Sudani 1994h . Nevertheless, we can see al ready that below 77 ~ 10 , there 
is a significant deviation from the scalings obtained in (lLongcope & Sudanll 19941) . who 
showed by numerical results and scaling analysis that both W and B ± should scale with 
n~ 1 ^ in the small n limit. We have added dotted lines in Fig.[T](a) and (b) showing the 
n~ 1 ^ 3 scaling. For more details of these nu merical results, as well as an analysis showing 
the transition o f scaling behavior from (lLongcope & Sudani 1 19941) to ours, please see 
dNg et al.ll201 lh . 



4. GPU implementation 

CUDA is a parallel computing engine developed specifically for general purpose ap- 
plications using NVidia GPUs. It allows programmers to leverage the high-throughput 
power of GPUs by programming in the familiar ANSI C language rather than resort- 
ing to reverse engineering native graphics languages (such as Open GL or Cg) to per- 
form general purpose and scientific computations. The CUDA model allows program- 
mers to delegate serial tasks required by the CPU by usual C code while extensions 
to C are provided for programming GPUs to exploit data parallelism. CUDA (now 
version 4.0 as of this writing) provides libraries for basic linear algebra (CUBLAS), 
sparse matrices (CUSPARSE), random number generation (CURAND), standard tem- 
plates (THRUST), and fast Fourier transforms (CUFFT) as well as tools for profiling 
(Co mpute Visual Profiler) and debugging (CUDA-gdb). In the present study, we fol- 
low IStantchev et all (120091) (and most all the studies mentioned previously) in adopting 
CUDA for our acceleration project. 

Our strategy for a comprehensive reprogramming of RMCT for GPU accelera- 
tion with CUDA can be summarized as follows: [1] Perform FFTs using the CUFFT 
library. The library is about an order of magnitude faster than our original FFT im- 
plementation when not considering CPU-GPU memory transfers but only several times 
faster when including them. We aim therefore to maximize the number of FFTs per 
memory transfer, and to perform intermediary tasks on the GPU. [2] Recycle memory 
of intermediate quantities. There is limited memory available on the GPU board so one 
must adequately budget memory allocated on board in such a manner as to observe [1] . 
[3] Write simple kernels for point-wise arithmetic. Point-wise arithmetic is the bread 
and butter of stencil-operati on based MHD codes, and CUD A impleme ntations of such 
codes have been published dWong et all d2009h IWang et all (120101) and|Zinjd d201ll) ). It 
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Figure 2. Strong scaling of RMCT CUDA on various large scale distributed mem- 
ory machines. 



is tempting to say that such operations take a back seat in the current pseudo-spectral 
application to FFTs, but as we have seen Amdahl's law would require that these kernels 
also see full consideration. [4] Preserve the underlying MPI decomposition. We are 
dealing here with two levels of parallelization, the first being the domain decomposi- 
tion in z and the second being the massive parallelism afforded by GPUs. The wide 
availability of multi-GPU workstations and GPU accelerated distributed memory ma- 
chines warrants the pursuit of both parallelization methods in tandem. For this code, 
we pair one CPU core to one GPU, each core responsible for one sub-cube in the the z 
domain decomposition scheme. 



5. GPU Code Performance and Discussion 

Figure[2]shows scaling results for RMCT CUDA on several GPU accelerated distributed 
memory machines. Measurements were made for each of five machines whose relevant 
specifications are summarized in Tabled] For a base-line comparison we use Carver, an 
IBM iDataplex mid-sized traditional CPU cluster at the National Energy Research Sci- 
entific Computing Center (NERSC) on which we have achieved the best performance 
thus far with the original version of the code. Carver hosts a GPU testbed cluster 
named Dirac which accelerates each of 44 nodes with a single NVidia C2050 GPU. 
A larger dedicated GPU production cluster at the National Center for Supercomput- 
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ing Applications (NCSA) was available through TeraGrid, and paired 96 S2070s (4 
GT200 GPU module) with 192 nodes. A dedicated GPU accelerated desktop worksta- 
tion was acquired for this project and is running at the University of Alaska Fairbanks 
with four C2050 GPUs. The most powerful machine on which we have tested is the 
Keeneland Initial Delivery System hosted at the National Institute for Computational 
Science (NICS) and Georgia Tech. It features 120 nodes each accelerated by 3 C2070s, 
and is to be expanded to a full production system for TeraGrid (now called XSEDE) 
using next generation NVidia GPUs in the coming year. 

As the codes currently stand, the CUDA port on Lincoln/NCSA at full scale is able 
to achieve roughly an order of magnitude speedup compared with Carver. Considering 
the performance of the code on the UAF workstation, the chip-to-chip equivalence is 
roughly 4, 8, and 16. As expected, the possibility of exposing maximal fine-grained 
parallelism given finer grids improves the potential for increasing this equivalence ratio. 
Actually observing an increase in this ratio then attests to the quality of the re-coding 
effort. At full scale on Keeneland, using the Fermi class GPUs the code is able to 
achieve up to a 30x speedup beyond what was previously achieved on Carver. 

The principal caveat of the present analysis is that we have spent much of our 
time optimizing the CUDA port, and have used as a basis of comparison, an original 
Fortran/MPI code which remains largely the same. The results reported here therefore 
should be taken as an effective speedup for a small academic coding team (consisting 
of a professor and graduate student) going from one practically deployable code to 
another. 
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